89 research outputs found

    On Randomly Projected Hierarchical Clustering with Guarantees

    Full text link
    Hierarchical clustering (HC) algorithms are generally limited to small data instances due to their runtime costs. Here we mitigate this shortcoming and explore fast HC algorithms based on random projections for single (SLC) and average (ALC) linkage clustering as well as for the minimum spanning tree problem (MST). We present a thorough adaptive analysis of our algorithms that improve prior work from O(N2)O(N^2) by up to a factor of N/(logN)2N/(\log N)^2 for a dataset of NN points in Euclidean space. The algorithms maintain, with arbitrary high probability, the outcome of hierarchical clustering as well as the worst-case running-time guarantees. We also present parameter-free instances of our algorithms.Comment: This version contains the conference paper "On Randomly Projected Hierarchical Clustering with Guarantees'', SIAM International Conference on Data Mining (SDM), 2014 and, additionally, proofs omitted in the conference versio

    Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

    Full text link
    Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the \emph{tightest} lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations \textit{directly} in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions, and leverage the theoretical analysis to develop a fast algorithm to obtain an \emph{exact} solution to the problem. The suggested solution provides the tightest estimation of the L2L_2-norm or the correlation. We show that typical data-analysis operations, such as k-NN search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD

    Approximate Matrix Multiplication with Application to Linear Embeddings

    Full text link
    In this paper, we study the problem of approximately computing the product of two real matrices. In particular, we analyze a dimensionality-reduction-based approximation algorithm due to Sarlos [1], introducing the notion of nuclear rank as the ratio of the nuclear norm over the spectral norm. The presented bound has improved dependence with respect to the approximation error (as compared to previous approaches), whereas the subspace -- on which we project the input matrices -- has dimensions proportional to the maximum of their nuclear rank and it is independent of the input dimensions. In addition, we provide an application of this result to linear low-dimensional embeddings. Namely, we show that any Euclidean point-set with bounded nuclear rank is amenable to projection onto number of dimensions that is independent of the input dimensionality, while achieving additive error guarantees.Comment: 8 pages, International Symposium on Information Theor

    Scalable and interpretable product recommendations via overlapping co-clustering

    Full text link
    We consider the problem of generating interpretable recommendations by identifying overlapping co-clusters of clients and products, based only on positive or implicit feedback. Our approach is applicable on very large datasets because it exhibits almost linear complexity in the input examples and the number of co-clusters. We show, both on real industrial data and on publicly available datasets, that the recommendation accuracy of our algorithm is competitive to that of state-of-art matrix factorization techniques. In addition, our technique has the advantage of offering recommendations that are textually and visually interpretable. Finally, we examine how to implement our technique efficiently on Graphical Processing Units (GPUs).Comment: In IEEE International Conference on Data Engineering (ICDE) 201

    Adaptive coarse-grained Monte Carlo simulation of reaction and diffusion dynamics in heterogeneous plasma membranes

    Get PDF
    Background: An adaptive coarse-grained (kinetic) Monte Carlo (ACGMC) simulation framework is applied to reaction and diffusion dynamics in inhomogeneous domains. The presented model is relevant to the diffusion and dimerization dynamics of epidermal growth factor receptor (EGFR) in the presence of plasma membrane heterogeneity and specifically receptor clustering. We perform simulations representing EGFR cluster dissipation in heterogeneous plasma membranes consisting of higher density clusters of receptors surrounded by low population areas using the ACGMC method. We further investigate the effect of key parameters on the cluster lifetime.Results: Coarse-graining of dimerization, rather than of diffusion, may lead to computational error. It is shown that the ACGMC method is an effective technique to minimize error in diffusion-reaction processes and is superior to the microscopic kinetic Monte Carlo simulation in terms of computational cost while retaining accuracy. The low computational cost enables sensitivity analysis calculations. Sensitivity analysis indicates that it may be possible to retain clusters of receptors over the time scale of minutes under suitable conditions and the cluster lifetime may depend on both receptor density and cluster size.Conclusions: The ACGMC method is an ideal platform to resolve large length and time scales in heterogeneous biological systems well beyond the plasma membrane and the EGFR system studied here. Our results demonstrate that cluster size must be considered in conjunction with receptor density, as they synergistically affect EGFR cluster lifetime. Further, the cluster lifetime being of the order of several seconds suggests that any mechanisms responsible for EGFR aggregation must operate on shorter timescales (at most a fraction of a second), to overcome dissipation and produce stable clusters observed experimentally. © 2010 Collins et al; licensee BioMed Central Ltd

    Recurrent Urinary Tract Infections due to Asymptomatic Colonic Diverticulitis

    Get PDF
    Colovesical fistula is a common complication of diverticulitis. Pneumaturia, fecaluria, urinary tract infections, abdominal pain, and dysuria are commonly reported. The authors report a case of colovesical fistula due to asymptomatic diverticulitis, and they emphasize the importance of deeply investigate recurrent urinary tract infection without any bowel symptoms. They also briefly review the literature

    A study on implementing a multithreaded version of the SIRENE detector simulation software for high energy neutrinos

    Full text link
    The primary objective of SIRENE is to simulate the response to neutrino events of any type of high energy neutrino telescope. Additionally, it implements different geometries for a neutrino detector and different configurations and characteristics of photo-multiplier tubes (PMTs) inside the optical modules of the detector through a library of C+ + classes. This could be considered a massive statistical analysis of photo-electrons. Aim of this work is the development of a multithreaded version of the SIRENE detector simulation software for high energy neutrinos. This approach allows utilization of multiple CPU cores leading to a potentially significant decrease in the required execution time compared to the sequential code. We are making use of the OpenMP framework for the production of multithreaded code running on the CPU. Finally, we analyze the feasibility of a GPU-accelerated implementation

    Discovering similar multidimensional trajectories

    No full text
    We investigate techniques for analysis and retrieval of object trajectories in a two or three dimensional space. Such kind of data usually contain a great amount of noise, that makes all previously used metrics fail. Therefore, here we formalize non-metric similarity functions based on the Longest Common Subsequence (LCSS), which are very robust to noise and furthermore provide an intuitive notion of similarity between trajectories by giving more weight to the similar portions of the sequences. Stretching of sequences in time is allowed, as well as global translating of the sequences in space. Efficient approximate algorithms that compute these similarity measures are also provided. We compare these new methods to the widely used Euclidean and Time Warping distance functions (for real and synthetic data) and show the superiority of our approach, especially under the strong presence of noise. We prove a weaker version of the triangle inequality and employ it in an indexing structure to answer nearest neighbor queries. Finally, we present experimental results that validate the accuracy and efficiency of our approach.

    Identifying Similarities, Periodicities and Bursts for Online Search Queries

    No full text
    We present several methods for mining knowledge from the query logs of the MSN search engine. Using the query logs, we build a time series for each query word or phrase (e.g., ‘Thanksgiving ’ or ‘Christmas gifts’) where the elements of the time series are the number of times that a query is issued on a day. All of the methods we describe use sequences of this form and can be applied to time series data generally. Our primary goal is the discovery of semantically similar queries and we do so by identifying queries with similar demand patterns. Utilizing the best Fourier coefficients and the energy of the omitted components, we improve upon the state-of-the-art in time-series similarity matching. The extracted sequence features are then organized in an efficient metric tree index structure. We also demonstrate how to efficiently and accurately discover the important periods in a time-series. Finally we propose a simple but effective method for identification of bursts (long or short-term). Using the burst information extracted from a sequence, we are able to efficiently perform ’query-by-burst ’ on the database of timeseries. We conclude the presentation with the description of a tool that uses the described methods, and serves as an interactive exploratory data discovery tool for the MSN query database. 1
    corecore